Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set
نویسندگان
چکیده
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.
منابع مشابه
Flexible Models for Microclustering with Application to Entity Resolution
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate....
متن کاملProbabilistic Size-constrained Microclustering
Microclustering refers to clustering models that produce small clusters or, equivalently, to models where the size of the clusters grows sublinearly with the number of samples. We formulate probabilistic microclustering models by assigning a prior distribution on the size of the clusters, and in particular consider microclustering models with explicit bounds on the size of the clusters. The com...
متن کاملSimulation of Fabrication toward High Quality Thin Films for Robotic Applications by Ionized Cluster Beam Deposition
The most commonly used method for the production of thin films is based on deposition of atoms or molecules onto a solid surface. One of the suitable method is to produce high quality metallic, semiconductor and organic thin film is Ionized cluster beam deposition (ICBD), which are used in electronic, robotic, optical, optoelectronic devices. Many important factors such as cluster size, cluster...
متن کاملTo Express Required CT-Scan Resolution for Porosity and Saturation Calculations in Terms of Average Grain Sizes
Despite advancements in specifying 3D internal microstructure of reservoir rocks, identifying some sensitive phenomenons are still problematic particularly due to image resolution limitation. Discretization study on such CT-scan data always has encountered with such conflicts that the original data do not fully describe the real porous media. As an alternative attractive approach, one can recon...
متن کاملImpact of region of interest size and location in Gafchromic film dosimetry
Introduction: Accurate film dosimetry requires careful consideration of sources of uncertainty. Some of the sources of uncertainty are dependent on the size and location of region of interest (ROI), especially in small fields. Avoiding the penumbra is often a reason for using a small ROI. In contrast, choosing very small ROIs may increase uncertainty due to the reduction of th...
متن کامل